Skip to content

fix: remove panic() from all reconciliation and CLI paths#477

Draft
flemzord wants to merge 29 commits into
mainfrom
fix/remove-panics
Draft

fix: remove panic() from all reconciliation and CLI paths#477
flemzord wants to merge 29 commits into
mainfrom
fix/remove-panics

Conversation

@flemzord

@flemzord flemzord commented Jun 9, 2026

Copy link
Copy Markdown
Member

Why

The project review (see review.md, finding C2/C3 and friends) identified 31 panic() call sites in production code. In a controller-runtime operator, any panic in a reconcile path, watch event handler, or shared helper crashes the whole controller-manager: a single malformed resource or transient API-server error takes down reconciliation for every stack in the cluster.

This PR removes every production panic() — 30 sites. The only one left is in internal/tests/internal/bootstrap.go (test bootstrap, where failing fast is idiomatic and Ginkgo owns the process).

One commit per panic fix so each change can be reviewed and reverted independently (panics fixed by a single coherent change — e.g. the two embedded-FS reads inside one closure — share a commit).

Approach by panic class

Class Fix
Helper already in an error-returning context Propagate the error (return err, wrapped where useful)
Helper without an error return (HashFromConfigMaps, LowerCamelCaseKind, LowerCaseKind, GetPublisherEnvVars, HashFromHash) Signature changed to return (T, error); all callers updated to propagate through their existing error paths
Watch map functions / event handlers (cannot return errors) Log via log.FromContext(ctx) and drop the event; per-stack loops continue instead of dying so one bad stack doesn't lose the others' events
init()-time scheme/equality registration utilruntime.Must(...), the idiom already used in cmd/main.go
URI.UnmarshalJSON Returns the parse error so only the offending object's decode fails
Dead code (core.CopyDir, no callers) Removed
kubectl-stacks CLI Errors flow through cobra's normal stderr + exit-code path; the deferred unlock failure in upgrade is joined into the command error via errors.Join

Notable details

  • internal/resources/resourcereferences/init.go: the loop-invariant GVKForObject lookup was also hoisted out of the owner-references loop.
  • internal/resources/gatewayhttpapis/create.go: the kind was resolved twice for the same value; the second call now reuses the first result.
  • internal/resources/stacks/init.go: the condition-building closure now returns the conversion error after marking the module condition False ("Unable to read module status"), preserving the deferred condition write.
  • No behavior change for the happy path anywhere: hash computations, env-var construction and Caddy/benthos config generation produce byte-identical output.

Verification

  • go build ./... and go vet ./... pass on the main module and the tools/kubectl-stacks module (run after each commit for the touched package, plus a full pass at the end).
  • make test: all 8 Ginkgo envtest suites pass.

flemzord added 28 commits June 9, 2026 22:00
The ObjectMutator signature already propagates errors; panicking on
SetOwnerReference failure would crash the whole controller-manager
instead of failing the single reconciliation.
GetAllStackDependencies already returns an error; panicking on a
FromUnstructured conversion failure would crash the controller-manager
on a single malformed resource instead of failing one reconciliation.
GetAs already returns an error; panicking on json.Marshal failure
would crash the controller-manager instead of failing the single
settings lookup.
The CreateOrUpdate mutate closure already returns an error; panicking
would crash the controller-manager instead of failing the single
ResourceReference reconciliation.
The enclosing reconcile helper already returns an error; panicking on
a hashing failure would crash the controller-manager instead of
failing the single Benthos reconciliation.
UnmarshalJSON runs while decoding user-provided custom resources: a
single CR carrying an unparseable URL would crash the whole
controller-manager. Returning the error fails only the decode of the
offending object.
Registering the URI semantic-equality function can only fail on a
programmer error and must abort startup; utilruntime.Must is the
established idiom for this (already used in cmd/main.go) and reports
the failure through the k8s error handlers instead of a bare panic.
Same idiom as cmd/main.go: scheme registration failure is a
programmer error that must abort startup; utilruntime.Must reports it
through the k8s error handlers instead of a bare panic.
CopyDir has no callers outside its own recursion and contained two
panic() calls on filesystem errors. Removing it eliminates both panic
sites.
…s in WithWatchVersions

A transient API-server error while listing stacks in the Versions
watch handler would crash the whole controller-manager. Log the error
and drop the event instead: the next Versions event re-triggers the
mapping.
… in WithWatchVersions

ObjectKinds failing for the watch target is a registration problem;
crashing the controller-manager from an event handler hides the root
cause. Log the error and drop the event instead.
…thWatchVersions

A transient list error for one stack would crash the whole
controller-manager and lose the events of every other stack. Log the
error and continue with the remaining stacks.
…Func

Same rationale as the other WithWatchVersions handlers: an event
handler must not crash the controller-manager on a kind-resolution
error.
A kind-resolution failure (unregistered type) would crash the whole
controller-manager from any module deployment helper. The function now
returns the error; the eleven callers propagate it through their
existing error paths, and the databases watch handler logs it and
drops the event.
Same rationale as LowerCamelCaseKind: kind-resolution failure must not
crash the controller-manager. Also reuse the already-computed object
name in gatewayhttpapis.Create instead of resolving the kind twice.
An unknown broker mode in a Broker status would crash the whole
controller-manager from any module deployment helper. The function now
returns an error that the five callers propagate through their
existing error paths, failing only the offending reconciliation.
Converting a module's unstructured content to read its status would
panic and crash the controller-manager on a single malformed module.
The condition-building closure now returns the error (after marking
the condition false) and the reconciliation fails normally.
…atchResource

A kind-resolution error in the watch map function would crash the
whole controller-manager. Log the error and drop the event instead.
Also hoist the loop-invariant GVK lookup out of the owner-references
loop.
createDeployment already has an error path; a hashing failure now
fails the single Auth reconciliation instead of crashing the
controller-manager.
A hashing failure would crash the whole controller-manager from any
of the three callers (benthosstreams, auths, caddy). The function now
returns the error and callers propagate it through their existing
error paths.
…king

The templates map was built in an immediately-invoked closure that
panicked on embedded-FS read errors, crashing the controller-manager.
Extract it into a buildTemplates helper that returns the error so the
single Benthos reconciliation fails instead.
The function already returns an error; a CLI must report failures
through its normal error path, not a panic stack trace.
…king

Same rationale as enable: report failures through the command's
normal error path.
…cking

Same rationale as enable: report failures through the command's
normal error path.
…icking

Same rationale as enable: report failures through the command's
normal error path.
…anicking

Same rationale as enable: report failures through the command's
normal error path.
… panicking

A failure to unlock stacks after an upgrade is exactly the situation
the user must be told about cleanly; join it with the command error
instead of panicking past the cobra error handling.
Register the scheme in main() and exit with a printed error instead
of panicking from init(), keeping all CLI failures on the same
stderr+exit-code path.
@coderabbitai

coderabbitai Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 5b3f65d0-1069-40f1-b362-35ff74be8244

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/remove-panics

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant